Skip to content

Conversation

@luoling8192
Copy link
Collaborator

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @luoling8192, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a new feature for document processing and storage. It introduces a dedicated "documents" table in the database to house scraped web content, along with its processed versions, summaries, and vector embeddings for efficient retrieval. The existing link resolver has been enhanced to automatically detect URLs in chat messages, scrape their content, and persist this information, enabling richer data analysis and search capabilities within the application.

Highlights

  • Document Storage: Introduced a new "documents" table in the database to store scraped web content, summaries, and vector embeddings.
  • Web Scraping Integration: The "link-resolver" now automatically scrapes content from URLs found in chat messages using r.jina.ai and stores it in the new "documents" table.
  • New Dependencies: Added @mendable/firecrawl-js to facilitate web scraping, along with its associated transitive dependencies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@luoling8192 luoling8192 marked this pull request as draft September 16, 2025 17:52
@netlify
Copy link

netlify bot commented Sep 16, 2025

Deploy Preview for tgsearch ready!

Name Link
🔨 Latest commit 19e598c
🔍 Latest deploy log https://app.netlify.com/projects/tgsearch/deploys/68c9a3a93114af0008bbfe53
😎 Deploy Preview https://deploy-preview-355--tgsearch.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new documents table and related functionality to scrape and store content from links found in messages. It adds a link resolver using Jina's reader service and a new dependency firecrawl-js. While this is a good feature addition, the review has identified several critical issues in the implementation of document retrieval, potential data integrity problems with foreign keys, and areas for improvement in the link resolver and schema definitions. Addressing these points will improve the correctness, robustness, and maintainability of the new feature.

const relevantDocs = (await withDb(async db => db
.select()
.from(documentsTable)
.where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The text search query sql${documentsTable.processed_content_jieba_tokens} @> ${content.text}`` is incorrect. The content.text is a raw string, but a `jsonb` containment query (`@>`) on `processed_content_jieba_tokens` expects a JSONB array of tokens. The input text should be tokenized first (e.g., using Jieba), and then the tokens should be passed to the SQL query as a JSON array string, for example: `sql`... @> ${JSON.stringify(tokens)}::jsonb`.

const relevantDocs = (await withDb(async db => db
.select()
.from(documentsTable)
.where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vector similarity operator <=> is being used in a WHERE clause. This operator calculates the distance and is typically used in an ORDER BY clause to sort results by similarity. Using it in a WHERE clause without a comparison is incorrect. You should use orderBy to sort by distance.

Suggested change
.where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)
.orderBy(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)

Comment on lines +75 to +99
const documents = []

if (content.text) {
const relevantDocs = (await withDb(async db => db
.select()
.from(documentsTable)
.where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`)
.limit(pagination?.limit || 10)
.offset(pagination?.offset || 0),
)).unwrap()
logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved jieba documents')
documents.push(...relevantDocs)
}

if (content.embedding && content.embedding.length > 0) {
const relevantDocs = (await withDb(async db => db
.select()
.from(documentsTable)
.where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`)
.limit(pagination?.limit || 10)
.offset(pagination?.offset || 0),
)).unwrap()
logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved vector documents')
documents.push(...relevantDocs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The retrieveDocuments function performs two separate database queries for text and vector search, and then concatenates the results. This approach has several issues:

  • Duplicates: It can return duplicate documents if a document matches both text and vector queries.
  • Incorrect Pagination: The limit and offset are applied to each query separately, not to the combined result set. This means you could get up to 2 * limit results, and pagination will not work as expected.
  • Inconsistent Ranking: The results from the two queries are not ranked against each other; they are simply appended. This can lead to less relevant results appearing before more relevant ones.

A better approach would be to combine these into a single query if possible, or fetch results and then merge, de-duplicate, and re-rank them in the application before applying pagination.


export const documentsTable = pgTable('documents', {
id: uuid().primaryKey().defaultRandom(),
chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The foreign key on chat_messages_id defaults to ON DELETE NO ACTION. This will cause an error when trying to delete a chat message that has associated documents, which can lead to data integrity issues or unexpected application behavior. If documents should be deleted along with their corresponding message, consider using onDelete: 'cascade'.

Suggested change
chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id),
chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id, { onDelete: 'cascade' }),

return Ok([] as CoreMessage[])
for (const message of opts.messages) {
if (message.content) {
const links = message.content.match(/https?:\/\/\S+/g)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex https?:\/\/\S+ for extracting links is a bit too greedy. It can incorrectly include trailing punctuation like commas, periods, or closing parentheses as part of the URL, which can lead to failed scraping attempts. Consider cleaning the matched URLs to remove such trailing characters.

if (links && links.length > 0) {
for (const link of links) {
try {
const response = await fetch(`https://r.jina.ai/${link}`, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URL for the Jina reader service is hardcoded. It's better to extract this into a constant or a configuration variable to make it easier to change or update in the future.

Comment on lines +36 to +42
await recordDocuments([{
chatMessagesId: message.uuid,
rawContent: doc,
processedContent: doc,
processedContentJiebaTokens: [],
summary: doc,
}])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The scraped document is used for rawContent, processedContent, and summary without any processing. Also, processedContentJiebaTokens is an empty array. For these fields to be effective for search and retrieval, the content should be processed (e.g., cleaned, tokenized) and a summary should be generated. This implementation seems like a placeholder and should be enhanced for the feature to be fully functional.

Comment on lines +25 to +30
}, table => [
index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),
index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),
index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),
index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The recordDocuments function in packages/core/src/models/documents.ts uses onConflictDoUpdate with chat_messages_id as the target. This implies that chat_messages_id should be unique. To enforce this at the database level and for clarity, you should add a unique index on this column.

Suggested change
}, table => [
index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),
index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),
index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),
index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),
])
}, table => [
uniqueIndex('documents_chat_messages_id_unique_index').on(table.chat_messages_id),
index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')),
index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')),
index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')),
index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')),
])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants